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Abstract 

We address in this paper the co-clustering and co-classification of bilingual data lay¬ 
ing in two linguistic similarity spaces when a comparability measure defining a mapping 
between these two spaces is available. A new approach that we can characterized as a 
three-mode data analysis scheme, is proposed to mix the comparability measure with the 
two similarity measures. Our aim is to improve jointly the accuracy of classification and 
clustering tasks performed in each of the two linguistic spaces, as well as the quality of 
the final alignment of comparable clusters that can be obtained. We used first some 
purely synthetic random data sets to assess our formal similarity-comparability mixing 
model. We then propose two variants of the comparability measure that has been de¬ 
fined by (Li & Gaussier, 2010) in the context of bilingual lexicon extraction to adapt it 
to clustering or categorizing tasks. These two variant measures are subsequently used to 
evaluate our similarity-comparability mixing model in the context of the co-classification 
and co-clustering of comparable textual data sets collected from Wikipedia categories for 
the English and French languages. Our experiments show clear improvements in clustering 
and classification accuracies when mixing comparability with similarity measures, with, 
as expected, a higher robustness obtained when the two comparability variant measures 
that we propose are used. We believe that this approach is particularly well suited for the 
construction of thematic comparable corpora of controllable quality. 


1. Introduction 

Parallel corpora are sets of tuples of aligned documents that are formed with texts placed 
alongside with their translation(s). If such resources are of great utility in particular in 
the field of assisted translation or multilingual information retrieval, they are expensive to 
develop and often difficult to transpose from a specialty domain to another. The notion 
of comparable corpora has emerged in the nineties to palliate this lack of versatility and 
expensiveness and to offer avenues to a wider scope of applications such as multilingual 
terminology extraction, multilingual information retrieval or knowledge engineering (Baker, 
1996), (EAGLES, 1996). However, the notion of comparability between documents expressed 
in different languages is not easy to introduce: it is widely admitted that two documents 
in different languages are comparable when they share analogous criteria of composition, 
genre and topics. The term of comparable corpora was introduced by (Eung &: Yee, 1998), 
(Munteanu, Fraser, & Marcu, 2004) and remains quite subjective. (Dejean & Gaussier, 
2002) proposed a quantitative definition of the concept of comparability according to which 
"Two corpora in two languages Ci and C 2 are called comparable if there is a significant 
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sub-part of the vocabulary of the Ci language corpus, respectively C 2 language corpus, 
whose translation is in the corpus of language £ 2 ^ respectively Ci" (Li & Gaussier, 2010) 
have then derived a quantitative measure that is based on a bilingual translation dictionary. 
This measure consists primarily in counting the presence of the translations of dictionary 
entries that occur in the paired documents. It depends in a non-explicit way upon jointly 
the coverage of the bilingual translation dictionary and the studied corpora themselves. 

This comparability measure defined for bilingual corpora indeed applies when dealing 
with monolingual documents that partition in two distinct linguistic spaces, as far as a 
bilingual dictionary connecting the two spaces is available. At a document level we thus 
face a situation where monolingual similarity measures exist in each linguistic space that 
are potentially linked by a comparability measure. In the scope of the construction of 
thematic comparable corpora, this leads to address the co-classification or co-clustering of 
bilingual data since we are targeting the mapping of highly comparable clusters of documents 
that are furthermore thematically coherent in each linguistic space, i.e. characterized by a 
high intra-similarity. We confront such situation when harvesting multilingual data from 
the web for instance. With the need for comparable resources getting pressing, approaches 
that exploit consistently similarity and comparability measures are becoming particularly 
useful. 

There is apparently no existing direct method available to map comparable clusters of 
documents that lay in two different linguistic spaces. Nevertheless, there exist some work 
which is somehow related to this problem, like biclustering, co-clustering, or two-mode 
clustering introduced by (Mirkin, 1996) and (Van Mechelen I, 2004). However, these works 
are mainly relevant to the clustering of the rows and columns (instances and features axes) 
of a given matrix and does not fit with the sort of three-mode categorization or clustering 
we are facing. 

Recently, (Jagarlamudi, Daume, & Udupa, 2011a), (Jagarlamudi, Udupa, Daume, k 
Bhole, 2011b) have developed quite successfully a supervised method that learns interlingual 
representations from aligned training documents. They exploit word association measures 
and bilingual dictionary to remove noisy pairs of aligned documents. In (Amini k Goutte, 
2010) the authors proposed to learn a co-classification from multi-lingual corpora, based on 
a co-regularization of the categories in order to maintain a consistency of the categorization 
process across languages. (Li, Gaussier, k Aizawa, 2011) have proposed a solution for 
clustering bilingual corpora by using the comparability measure only. 

However, if our approach also seeks the joint clustering or classification of data that lay 
in two distinct linguistic spaces, it aims at exploiting, in conjunction with a comparability 
mapping existing between the two spaces, native similarity measures (a native similarity 
has to be understood as any quantitative intra-language similarity measure, such as a cosine 
similarity measure) existing within these two linguistic spaces. More precisely, the proposed 
approach is lying between the work reported in (Amini k Goutte, 2010) and (Li et ah, 2011). 
It exploits directly, i.e. without any learning phase, the comparability measure that maps 
the two linguistic spaces to provide new similarity measures that combine native similarity 
measures with a similarity measure that is induced by the comparability mapping. 

Thus the approach that we develop in the following sections only rely on a bilingual 
dictionary and does not assume that any aligned data preexist as learning data. Indeed, this 
approach could be enriched using a feature-extraction technique, such as the one proposed 
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in (Vu, Aw, &: Zhang, 2009) for instance, to align bilingual documents that have a similar 
content. 

After introducing our main motivations, we develop a straightforward mixing model 
to combine similarity and comparability measures in an efficient way that allows for the 
development of consistent co-clustering and co-classification of comparable data and assess 
it on purely synthetic data. We then address the concept of comparability for mapping 
bilingual textual data, and define, from the original measure proposed by (Li et ah, 2011), 
two alternative variant measures to overcome some limitation of the original measure. To 
assess the proposed approach on real textual data, we then detail an experimentation based 
on a subset of comparable documents collected from some Wikipedia categories. Basically, 
we evaluate jointly the three tested comparability measures and the proposed similarity- 
comparability mixing model in the scope of co-classification and co-clustering of bilingual 
data. Finally we discuss our results and draw some perspectives. 

2. Motivations: similarity spaces connected by a comparability mapping 

When confronting with complex data one may encounter situations where two distinct spaces 
S and S', in which preexist native similarity measures Ss and Ss', are interconnected by a 
mapping Css'- Figured] gives an example of such situation. This is the case when considering 
comparable corpora that are composed with texts written in at least two distinct languages. 
For such data, a bilingual dictionary allows for the construction of a comparability measure 
(Li & Gaussier, 2010) yielding to the definition of a comparability mapping (Marteau &: 
Menier, 2013) that links the two sets of comparable documents. More generally speaking, 
such case arises in situations where heterogeneous but analogous data is available, through 
different sources, in different formats, or characterized using different sets of descriptors, or 
comply to different semantic models such as heterogeneous ontologies for instance, etc. The 
principle of mapping heterogeneous but comparable data that we address is quite general 
since it takes the form of any bipartite weighted undirected graph. We call it a comparability 
mapping. Hence, a comparability mapping establishes a bi-directional connection between 
the elements of the two similarity spaces that could be used to challenge native similarity 
measures (or distances) defined in the two spaces. By doing so, we introduce a kind of three 
mode data analysis scheme: the two first modes are associated to the two native similarity 
spaces, while the third mode is related to the comparability mapping itself that links these 
two spaces. 

As an example, in figure [T| two discrete sets of elements S and S' are presented. We 
suppose that the notion of native similarity between elements of these sets is defined, we 
call them respectively Ss{., .): 5 x 5 —>• M and Ssi{., .): S' x S' ^ M. 

Furthermore, the two sets are point-wisely connected by a mapping defined by a compa¬ 
rability measure (.,.): 5 X 5' —>■ M. This mapping that takes the form of a bipartite 
graph is a comparability mapping. The edges of this graph are bidirectional and weighted 
by a real value that can be bounded into [—1,1]. 

The main idea that we develop in this article is that of similarity induced by a compa¬ 
rability mapping: in other words, if two elements in the set <S are mapped to a same subset 
of elements in the set S', then their similarity should be important (and vice versa), a 
contrario, if two elements in the set S' are mapped to disjoint subsets of elements in the set 
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Figure 1: Two similarity spaces connected by a comparability mapping. 


S, then their similarity should be small (and vice versa). Thus, in figure [T] from the point 
of view of the similarity derived from the comparability mapping alone, element 63 should 
move away from element 62 to get closer to elements 64 and 65 . Similarly, element eg should 
move away from elements e'3, 64 and 65 . The expected utility of such a similarity induced 
by a comparability mapping is a kind of noise hltering capability. When exploited in con¬ 
junction with native similarity measures in S and S' a fusion of complementary sources of 
knowledge is achieved that could help building more robust similarity functions into S and 
S' spaces. The noise in question could have many sources, in particular it could be inherent 
to the representational models of the element themselves due to a lack of knowledge, e.g. 
lack of structural variability, data heterogeneity, semantic ambiguities, etc. 

3. Combining similarity and comparability: a three-mode analysis scheme 
3.1 Similarity measure induced by a comparability mapping 

In this line of work, (Marteau & Menier, 2013) proposed an algorithm, Hit-ComSini, to 
iteratively construct the concept of similarity induced by a comparability bipartite graph. 
Unfortunately, this algorithm does not scale well due to its high algorithmic complexity (in 
0{N‘^)). We propose here a much more straightforward approach that consists in exploiting 
directly the comparability matrix constructed from the two bilingual finite collections of 
documents. 

Let us consider S and S' two collections of documents belonging to two distinct linguistic 
spaces {C and CJ respectively) in which two native similarity measures Ss and Sg' are 
defined. Let C (.,.) : Sg 'X Sgi —> 7^ be the comparability function that maps the two finite 
collections or equivalently that defines a weighted bipartite graph between the two linguistic 
spaces. The two similarity functions Sg, Sg/ and the comparability measure C allows for 
the definition of the following three-mode analysis scheme. 
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We define the similarity measures induced by the comparability mapping C as the fol¬ 
lowing two normalized (in [— 1 , 1 ]) measures respectively noted Ssi,c Ss'fi- 

\/{di,dj) £ 5^ and y{d[,d'-) £ 


Q ./I 

' ^C^Cii,i)C^Cij,j) 

The interpretation of the similarity measures that are induced by a comparability map¬ 
ping C is straightforward. First, considering each row i of the C matrix as a feature vector 
that characterizes document di £ S, for any {di,dj) £ S, CC'^{i,j) can be interpreted as 
an inner product between the two feature vectors representing di and dj respectively. Then, 
Ss,cidi,dj) is nothing but a cosine similarity between documents di and dj based on the 
comparability mapping only. 

Similarly, considering each column i of the C matrix as a feature vector that characterizes 
document d[ £ S', Ss',c{d'i,d'j) is nothing but a cosine similarity between documents d'^ and 
d'j £ S' based on the comparability mapping only. 

3.2 Mixing native similarity and induced similarity 

The comparability/similarity mixing model that we propose is a simple linear combination 
of the native and induced similarity measures defined in each linguistic space. Basically we 
use a single parameter a £ [ 0 , 1 ] to combine linearly the two measures as follows 

S'g{di,dj) = aSs,c{di,dj) -h (1 - a)Ss{di,dj) , . 

= aSs>,cK,d'^) + (1 - a)Ss'K,d'^) ^ ’ 

Since the induced similarity measures are normalized into the interval [—1, 1 ], we advo¬ 
cate using a cosine similarity as native similarity measures in the two connected linguistic 
spaces such that the mixed similarity measures defined by equation [ 2 ] are consistent. 

Finally, as this model mixes two sources of native similarity with the induced similarity 
measures that are directly derived from the comparability mapping, it implements the so- 
called three-mode data analysis scheme that we were referring to in the motivation section. 

4. Experimenting on synthetic data 

To evaluate the effectiveness of the proposed similarity-comparability mixing model, we 
generated 20 distinct tests by randomly defining: 

• two similarity spaces S and S', 

• a categorization of the elements within each of these spaces, 

• the comparability mapping between them. 
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The algorithm [T] describes the way each of these 20 tests is generated. The variance 
parameters Vg and Vc are set up such that the categories are significantly overlapping making 
the classihcation problems difficult enough. To put more discriminative weight on the native 
similarity measures, the variance Vc associated to the comparability mapping matrix that 
is used to provide the induced similarity measures is three times the variance Vg used for 
producing the native similarity measures. 


Algorithm 1 Random generation of two native similarity spaces connected by a compa¬ 
rability mapping. The algorithm provides two random similarity matrices Sg and Sg/, the 
random comparability mapping matrix Cg^g', two sets of comparable clusters associated to 
a cluster map, mapC, also randomly dehned on spaces S and S'. 

// Randn{n, m) returns an n-by-m matrix containing pseudo-random values drawn 
// from a normal distribution with mean zero and standard deviation one. 

// Randni) returns a single value from the previous distribution. 

0) R, = 1.0; Re = 3.0; 

1) Randomly select the number of clusters (resp. n^,) in S (resp. S') from the set 
{3,..., 18}; 

2) For each cluster in S (resp. cj in <S') randomly select the number of elements in Cfc, 
\ck\ (resp.c(, |c^|) from the set {20, ..,40}; 

3) For each pair of elements {ei,ej) in S^ (resp. <S'^) 
if Ci and ej belong to the same cluster then 

Sg{ei, ej) = 0.5 -|- R 5 * Randn{)] 
resp. Sg/{ei,ej) = 0.5 -|- Rs * Randn0\ 
else 

Sg{ei, Cj) = -0.5 -I- R^ * Rand{)] 
resp. Sgi{ei,ej) = —0.5 -|- R 5 * RandQ] 

end if 

4) mapC = Randn{n‘^,n'^,)] 

J = 0 

for A: = 1 : do 

7 = 0 ; 

for 1 = 1: Ug, do 
for z = 1 : |c;| do 
for j = 1: \ck\ do 

Cg,g'{I + i,J + j) = randni) * Rc -|- mapC{l, k)] 

end for 
end for 

i = i + Wi\; 
end for 
J >7 -l-1 c/j I, 

end for 


Table [U gives for each similarity spaces S and S' the number of elements and the number 
of clusters for each of the 20 tests. 
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S: 1-NN, 10 foldscCross validation 



S': 1-NN, 10 folds cross validation 



a 


Figure 2: Comparability/similarity mixing effect on the 1-NN classification task, according 
to a 10-fold cross validation procedure. The error rates in % are given when the 
native similarity alone is used (black square curve) when the induced similarity 
alone is used (red triangle curve) and when the mixing model is used as a function 
of parameter a € [0,1] (blue circle curve). Top 5, bottom S' similarity spaces 
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5 

5' 

5 

5' 

#clusters 

^elements 

^clusters 

^elements 

^clusters 

^elements 

^clusters 

^elements 

15 

491 

7 

189 

15 

411 

11 

314 

12 

355 

5 

148 

15 

524 

12 

343 

16 

485 

10 

309 

6 

169 

17 

488 

16 

516 

8 

243 

8 

223 

14 

420 

9 

287 

17 

486 

18 

546 

5 

147 

8 

247 

11 

322 

13 

415 

12 

385 

18 

540 

14 

405 

11 

347 

16 

499 

14 

445 

12 

386 

11 

366 

6 

191 

8 

246 

17 

540 

17 

547 

17 

528 

18 

565 

4 

136 

5 

160 

16 

456 


Table 1: Number of clusters and number of total elements in each synthetic similarity spaces 
S and S' for the twenty tests used for this experiment. 


To evaluate the effectiveness of a 1-NN classification as the mixing parameter a varies, 
we use the classification error rate measure. The mean and the variance for this measure 
are estimated on the basis of the 20 tests and the 10-fold cross validation. 

Figure [2] gives the mean and variance of the error rates for the 20 tests obtained when 
a 10 fold cross validation is performed using a 1-NN classifier. As shown in this hgure the 
1-NN classification using the induced similarity measures alone performs the worse, which 
was expected since the variance on the comparability mapping is three times the one used 
to generate the native similarity measures. The error rate is thus 46%in S and in S' when 
the induced similarity measures alone are used and 35% in S and 39% in S' when the native 
similarity measures alone are used. 

The effect of mixing native and induced similarity measures is strongly effective on these 
synthetic data sets since for both spaces the error rates drop below 25% and reach a minimum 
when a = 0.75. Note that the variance of the mean error decrease slightly when the mixing 
parameter a is around this optimal value 0.75. This experiment shows that even when the 
native and induced similarity measures are significantly noisy, the combination of the two 
sources of information allows for a significant reduction of the noise. 

This is precisely this effect that we would like to show on real bilingual comparable data, 
when the comparability mapping is elaborated from a bilingual lexicon. 

5. Variations around a quantitative comparability measure for bilingual 
texts 

5.1 Comparability measure by Li and Gaussier (Clg) 

The quantitative comparability measure proposed by (Li k Gaussier, 2010) is based on the 
simple counting of word translation connections that exist between two corpora in different 
languages according to a translation lexicon. Formally, let 5i and S 2 be two corpora ex- 
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pressed respectively in language Ci and C 2 - This comparability measure is formally defined 
as: 


^ a{wi)+ a{w2) 
f' ^fc o'!_ uiiGW<Sinvpz)i 'W2&ws2r(WD2 

“ |iT 5 i n WDi\ + \WS2 n wd2\ 


( 3 ) 


where: WSi, i G {1,2} is the lexicon in language Ci associated with the corpus Si] WDi 
is the set of entries for language Ci into the bilingual dictionary that occur in WSi] is 

an indicator function that takes the value 1 if at least one potential translation of the term 
Wi G WSi in language Ci exists in the vocabulary associated with the corpus of the other 
language, 0 otherwise. 

This measure was originally designed for a bilingual lexicon extraction purposes, and 
not for the clustering or categorization of textual data. Hence, the authors did not in¬ 
corporate any term weighting since it is a priori irrelevant for a lexicon extraction task. 
However, if their definition is in line with a general definition of comparability such as the 
one given in introduction, the lack of term weighting is questionable when addressing a 
clustering/categorization task. The two variants that we propose hereinafter introduce a 
term weighting based on the number of term occurrences to specifically adapt the measure 
defined by (Li &: Gaussier, 2010) to clustering or categorizing tasks. 

5.2 Enrichment of the Clg measure 

The Clg measure proposed by Li and Gaussier (eql3|) takes account of neither the number 
of occurrences of the lexical entries in the documents nor their number of translations into 
the paired documents. The binary presence or absence of joint translation entries that is 
modeled by the indicator function a{wi) is a strong feature that may affect the average 
comparability between pairs of documents. This could be the case when addressing corpora 
for which frequency of lexical entries helps discriminating between genres and topics. We 
propose the following two similar variants of the Clg measure that explicitly propose to 
go beyond the presence or absence of joint translations, conjecturing that this improvement 
will produce a positive effect in certain situations and tasks. 

5.2.1 First variant : CvAi 

The first variant symmetrically exploits (from the stand point of Ci and C 2 languages) the 
following three elements: the number of occurrences of entries w taken into the vocabulary 
of the first language corpus, the number of their translations in the bilingual dictionary and 
the presence of at least one of their translations in the vocabulary of the second language 
corpus. 

Let Aij 2 , Ai, H 211 , A 2 be defined as follows: 
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^l |2 — 


E 

widW S\r\W D\ 

■^1= E 

w\£W SiOW Di 

^ 2|1 = X] 

W2^W S2C\W D 2 

A2= ^ 

W2GWS2nlVD2 


tf(wi,Si) 
t(wi,WDi) 
tf(wi,Si) \ 
t(wi,WDi)J 
tf(w2,S2) 
t(w 2,WD2) 
tf(w2,S2) \ 
t(w2,WD2)J 


a(wi) 


cr{w2) 


where tf{wi,Si) is the number of occurrences of entry Wi in the corpus Si expressed in 
language Ci, i € {1,2}; T{wi, WDi) is the number of translations of entry wt of the corpus 
Si in the dictionary WDi] a{wi) is defined as above. 


CvAi = 


1 


A 


1|2 




+ 


2|1 


Ai A 2 


( 4 ) 


5.2.2 Second variant : Cva2 

This second variant is very similar to the previous one. It distinguishes mainly on the way 
the measure is symmetrized. Basically the first variant relates to a geometric mean while 
the second variant relates to an arithmetic mean. 


CvA2 


Ai\2 + ^ 2|1 

Ai + A 2 


( 5 ) 


6. Experimenting on textual bilingual data 

We have collected the assessment corpora from 21 Wikipedia categories, from English (EN) 
and French (FR) languages. It originally consists of 154828 documents in total with 87793 
English documents and 67035 French documents categorized in 21 categories, taken from 
existing Wikipedia categories. Since such corpus is thematically very large, corresponding 
similarity and comparability matrices are basically very sparse. To avoid the algorithmic 
complexity behind the calculation of the induces similarity matrices {0{N^)), we proceeded 
as follows which drastically reduces the sparsity of our matrices: 


1. For each class and each language, we evaluate firstly the intra-language similarity 
matrices, using a cosine similarity based on a tf — idf weighting, 

2 . secondly, we prune these intra-language similarity matrices using a threshold (typically 
0.5) and order the documents according to their number of remaining neighbors (with 
whom they share a similarity above the threshold). 

3. by keeping for each language the best hundred documents, we get a refined comparable 
bilingual corpus. 
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4. Finally, to complexify the experiment, we enrich this corpus by adding, for each lan¬ 
guage, and for each class, 50% of the initial number of documents. These added 
documents are randomly drawn from the initial 21 Wikipedia categories. 

Each Wikipedia article is then represented by its plain textual content, tags and hyperlink 
have thus been removed. This Wikipedia corpu^ contains 5822 documents in total, and is 
composed with 2745 French documents and 3077 English documents distributed into the 21 
categories as listed in Table [2j 


EN categories 

# doc 

FR categories 

# doc 

EN categories 

# doc 

FR categories 

# doc 

Astronomy 

151 

Astronomie 

123 

Movie 

151 

Film 

151 

Biology 

151 

Biologie 

115 

Music 

151 

Musique 

151 

Economy 

144 

Economie 

151 

Skating 

151 

Patinage 

151 

Food 

147 

Nourriture 

4 

Heritage 

151 

Patrimoine 

151 

Football 

151 

Football 

151 

Politics 

151 

Politique 

151 

Genetics 

82 

Genetique 

151 

Religion 

150 

Religion 

133 

Geograpphy 

139 

Geographic 

151 

Rugby 

151 

Rugby 

151 

Gomputer 

151 

Ordinateur 

151 

Health 

151 

Sante 

63 

Literature 

150 

Litterature 

151 

Sculpture 

151 

Sculpture 

151 

Mathematics 

151 

Mathematique 

63 

Tennis 

151 

Tennis 

151 

Medicine 

151 

Medecine 

130 






Table 2: Composition of the comparable bilingual corpus extracted from Wikipedia (EN: 
English, FR: French) 


This corpus has been lemmatized using the TreeTagger (Schmid, 1994) (Schmid, 2009). 

Stoplists for French and English languages have been used and the term frequencies (tf) for 
each vocabulary entry/document pair have been evaluated, as well as the inverse document 
frequencies idf (Sparck Jones, 1972) that were estimated on the corpus. Each Wikipedia 
article is finally represented by a tf-idf weighted vector according to the classical vector 
space model (Salton, Wong, Sz Yang, 1975). 

6.1 Bilingual dictionary 

To estimate the quantitative comparability between a pair of Enlish/French documents we 
have used the bilingual dictionary available at ELRA under reference ELRA-M0033. This 
dictionary contains 243,580 pairs of lexical entries in French and in English, which decompose 
into 110,541 lexical entries in English and 109,196 lexical entries in French. 

The influence of the dictionary coverage rate has been partially studied in (Li & Gaussier, 

2010) and (Ke, Marteau, &: Menier, 2014). It is shown that, for all three comparability 
measures ClGj CvAi and CvA 2 ^ fhe correlation of these measures with a gold standard 
comparability measure reference degrades when the dictionary coverage rate relatively to 
the corpus lexicon decreases. We do not address this issue in this paper, keeping in mind 

1. The Wikipedia corpus is available at http: //people. irisa.fr/Pierre-Francois .Marteau/Corpora/Wikipedia_21 classes .zi 
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that an enrichment of the bilingual dictionary by including in particular domain dependent 
bilingual terminology entries would likely greatly improve our results. 


6.2 Evaluation measures 

The performance of the 1-NN classifier is evaluated using the classification error rate esti¬ 
mate using a 10-fold cross validation. The performance of the tested clustering algorithms 
are also evaluated by comparing the predicted label for each document with its true label. 
The accuracy (AC) and normalized mutual information (NMI) measures are used to evalu¬ 
ate the clustering performance (Wei Xu k Gong, 2003). As an internal evaluation scheme 
for estimating the quality of the clustering obtained in each linguistic space, we also use the 
Davies-Bouldin index (DB) (Davies & Bouldin, 1979) which roughly measures the quotient 
of intra and inter cluster average similarity measures. 


The accuracy (AC) measure is dehned as follows: it measures the fraction of documents 
that are correctly labeled, assuming a one-to-one correspondence between true categories 
and assigned clusters. Let p denote any possible permutation of index set of clusters and 
true categories. The Accuracy is thus defined as 

AC = ^MAXp ^ (6) 

i=l-K 

where denotes the number of documents shared by class i and cluster p{i), K is the 

number of categories and clusters, and N is the total number of documents. 


The NMI measure between the true clustering C and the predicted one C is defined as 
follows: 


NMI{C,C) 


I{C,C) 

{H{C) + HiC))/2 


( 7 ) 


with 


nc,c) = 

and 

H{C) = -Y,P{ck) log P{ck) 

k 

H{C) = Pick) log P{ck) 

k 

The Davies-Boulding index DB is a data intrinsic evaluation measure, which is defined 
as follows 


DB = - 



( 8 ) 
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where K is the number of clusters, is the centroid of cluster, is the average distance 
of all elements in cluster k to centroid Cfc, and d{ci,Cj) is the distance between centroids i 
and j. The lower is this DB index value, the better is the clustering since this corresponds 
to low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances 
(low inter-cluster similarity). 

7. Experiments 

On the basis of the categorized comparable corpora collected from Wikipedia, we assess 
the beneht of mixing native similarity measures with comparability on a 1-NN classification 
task and on a k-medoid clustering (Kaufman & Rousseeuw, 1987) (Kaufman & Rousseeuw, 
1990) task. 

7.1 1-NN classification task 

We first study the effect of mixing similarity and comparability on the 1-NN classification 
task while varying the parameter a G [0,1]. 

Figures [3] and S] show that the similarity/comparability mixing has a significant impact 
for the two variants CyAi and CvA 2 since it allows reducing by 3% the error rate of the 
classification for the English language documents and 1.5% for the French language docu¬ 
ments. However, comparatively, the Clg measure improves slightly for both languages the 
classihcation accuracy, and is less stable when a varies. 

7.2 k-medoids clustering task 

We study here the effect of mixing comparability and similarity measures on a k-medoids 
clustering task for all three comparability measures. We used the previously defined AC, 
NMI and DB measures for the assessment of this clustering task. 

Figures [5] and El show that both AC and NMI measures can be improved up to 15% in 
the scope of the clustering of French language documents and up to 3% in the scope of the 
clustering of English language documents for both CvAi and CvA 2 measures. However, once 
again, the Clg brings comparatively less improvement for both languages. 

Figure [7] depicts the DB measure as a function of parameter a, for all three compara¬ 
bility measures. It is shown that, for CvAi and CvA 2 i this ratio decreases for some good a 
values, especially for the French language, whereas for the measure Clg^ this value increases 
in general. A good mixing of the comparability and similarity measures has thus a positive 
impact when using CyAi and CvA 2 nieasures and a rather negative impact when using the 
Clg measure. 
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1 nn leave one out tf_21 classes 



Figure 3: Comparability/similarity mixing effect on the 1-NN classification task, accord¬ 
ing to the leave one out error rate (top EN documents, bottom FR documents). 
Clg (black plain curve), CvA\ (blue dashdotted curve), CvAi (red dotted curve) 
measures are given as a function of the mixing parameter a. 
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Figure 4: Comparability/similarity mixing effect on the 1-NN classification task, according 
to 10-fold cross validation error rate (top EN documents, bottom FR documents). 
Clg (black plain curve), CvAi (blue triangle dashdotted curve), CvA\ (red star 
dotted curve) measures are given as a function of the mixing parameter a. 
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k-medoids tf_21 classes 



k-medoids tf_21 classes 



Figure 5: Evaluation of the comparability/similarity mixing on the k-medoids clustering 
accuracy (AC) (top EN documents, bottom FR documents). Clg (black plain 
curve), CvAi (blue triangle dashdotted curve), CvAi (red star dotted curve) mea¬ 
sures are given as a function of the mixing parameter a. 
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k-medoids tf_21 classes 



k-medoids tf_21 classes 



Figure 6: Evaluation of the mixing of comparability and similarity measures on the k- 
medoids clustering according to the NMI measure (top EN documents, bottom 
FR documents). Clg (black plain curve), CyA^ (blue triangle dashdotted curve), 
CvA\ (red star dotted curve) measures are given as a function of the mixing 
parameter a. 
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Figure 7: Comparability/similarity mixing effect on a k-medoids clustering according to 
the DB measure (top EN documents, bottom FR documents). Clg (black plain 
curve), CvAi (blue triangle dashdotted curve), CyAi (I'ed star dotted curve) mea¬ 
sures are given as a function of the mixing parameter a. 


18 






































Exploiting a comparability mapping to improve bi-lingual data categorization 


8. Analysis and conclusions 

In this paper, we have proposed a new approach for the co-clustering and co-categorization 
of bi-lingual data when a comparability mapping exists. This approach, that could be 
characterized as a kind of three-mode clustering or categorization, is based on the concept 
of similarity induced by a comparability bipartite graph. The three-mode data analysis 
scheme is implemented as a mixing model used to merge native and induced similarity 
measures inside each of the two linguistic space. The assessment of this mixing model on 
purely synthetic random data is quite informative and demonstrates the noise reduction 
capability of the method. 

On real bilingual textual data, the approach involves a quantitative comparability mea¬ 
sure that is based on the exploitation of a bilingual dictionary. To this end, two variants of 
the comparability measure proposed by (Li Gaussier, 2010) have been proposed to adapt 
this measure to clustering and categorization tasks. The implementation of our model on 
semi-manually constructed comparable corpora collected from the Wikipedia resource shows 
to be quite effective. Our detailed experimentation shows that the mixing of native similarity 
measures with a quantitative comparability measure has a clear impact on the classification 
and clustering accuracies. It is noticeable that the improvement is more important in the 
French linguistic space comparatively to the English linguistic space. Furthermore, our ap¬ 
proach works specifically well for the CvAi and CvA 2 comparability variant measures with 
stable and robust classification or clustering result improvements. 

It nevertheless has a small positive impact when the Clg measure is used, leading to 
conclude that taking into account of the frequency of occurrence of lexical entries and fre¬ 
quencies of their translations into the comparability measure design is of crucial importance 
for thematic classification or clustering of bilingual English/French documents. One poten¬ 
tial explanation is that these frequencies of occurrence pair well with the tf-idf heuristic that 
takes place in native cosine similarity. Moreover, according to our results, the choice of the 
value of the mixing parameter a is quite important. A relatively high a value (between 0.5 
and 0.8), that slightly favors the induced similarity measures comparatively to the native 
similarity, will be a good choice in general. Finally our experimentation shows that the 
CvA 2 : whose symmetrization is homogeneous to an arithmetic mean, is more robust than 
CvAi I a result that needs to be consolidated on other independent experiments. 

In terms of perspective, ensuring the scalability and generalizing the approach and ex¬ 
perimentation are major prospects to help constructing thematic comparable corpora on 
demand. 

The bilingual dictionary is a particularly important resource in our approach, since the 
quality of the comparability mapping linking the two linguistic spaces directly relies on it. 
The impact of the coverage of the dictionary relatively to the corpus has been partly studied 
in (Li & Gaussier, 2010) and (Ke et ah, 2014). In the context of comparable thematic data 
processing, it is likely that the enrichment of a general bilingual resource by introducing 
domain specific terminology entries would bring some benefit. 

Finally, another perspective is to expand it to various pairing of languages for which 
bilingual resources are available, in particular bilingual dictionaries. 
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